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ABSTRACT 

Text categorization is typically formulated as a concept learn- 
ing problem where each instance is a single isolated docu- 
ment. In this paper we are interested in a more general for- 
mulation where documents are organized as page sequences, 
as naturally occurring in digital libraries of scanned books 
and magazines. We describe a method for classifying pages 
of sequential OCR text documents into one of several as- 
signed categories and suggest that taking into account con- 
textual information provided by the whole page sequence 
can significantly improve classification accuracy. The pro- 
posed architecture relies on hidden Markov models whose 
emissions are bag-of-words according to a multinomial word 
event model, as in the generative portion of the Naive Bayes 
classifier. Our results on a collection of scanned journals 
from the Making of America project confirm the importance 
of using whole page sequences. Empirical evaluation indi- 
cates that the error rate (as obtained by running a plain 
Naive Bayes classifier on isolated page) can be roughly re- 
duced by half if contextual information is incorporated. 
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1. INTRODUCTION 

Text categorization is the problem of grouping textual 
documents into different fixed classes or categories. The 
task is related to the ability of an intelligent system to auto- 
matically perform tasks such as personalized e-mail or news 
filtering, document indexing, metadata extraction. These 
problems are of great and increasing importance, mainly 
because of the recent explosive increase of online textual in- 
formation. Text categorization is generally formulated in 
the machine learning framework. In this setting, a learning 
algorithm takes as input a set of labeled examples (where 
the label indicates which category the example document be- 
longs to) and attempts to infer a function that will map new 
documents into their categories. Several algorithms have 
been proposed within this framework, including regression 
models [29], inductive logic programming [6], probabilistic 
classifiers [17, 21, 16], decision trees [18], neural networks 
[22], and more recently support vector machines [12]. 

Research on text categorization has been mainly focused 
on non-structured documents. In the typical approach, in- 
herited from information retrieval, each document is rep- 
resented by a sequence of words, and the sequence itself is 
normally flattened down to a simplified representation called 
bag of words (BOW). This is like representing each docu- 
ment as a feature- vector, where features are words in the 
vocabulary and components of the feature- vector are statis- 
tics such as word counts in the document. Although such 
a simplified representation is appropriate for relatively flat 
documents (such as email and news messages), other types 
of documents are internally structured and this structure 
should be exploited in the representation to better inform 
the learner. 

In this paper we are interested in the domain of digital 
libraries and, in particular, collections of digitized books 
or magazines, with text extracted by an Optical Charac- 
ter Recognition (OCR) system. One important challenge 
for digital conversion projects is the management of struc- 
tural and descriptive metadata. Currently, metadata man- 
agement involves a large amount of keying work carried out 
by human operators. Automating the extraction of meta- 
data from digitized documents could greatly improve effi- 
ciency and productivity [1]. This automation, however, is 
not a trivial task and involves recognition of the ordering of 
text divisions, such as chapters and sub-chapters, the iden- 
tification of layout elements, such as headlines, footnotes, 
graphs, and captions, and the linking of articles within a pe- 
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riodical. Automatic recognition of these elements can be a 
hard problem, especially without any prior knowledge about 
the type of elements that are expected to be present within 
a given document page. Hence, page classification can rep- 
resent a useful preliminary step to guide the subsequent ex- 
traction process. Moreover, extracting metadata related to 
the semantic contents of document parts (such as chapters 
or articles) can require the ability of recognizing the topic or 
the category of these parts. The solution to these problems 
can be helped by a classifier that assigns a category to each 
page of the document. 

Unlike email or news articles, books and periodicals are 
multi-page documents and the simplest level of structure 
that can be exploited is the serial order relation defined 
among single pages. The task we consider is the automatic 
categorization of each page according to its (semantic) con- 
tents 1 . Exploiting the serial order relation among pages 
within a single document can be expected to improve classi- 
fication accuracy when compared to a strategy that simply 
classifies each page separately. This is because the sequence 
of pages in documents such as books or magazines often 
follows regularities such as those implied by typographical 
and editorial conventions. Consider for example the do- 
main of books and suppose categories of interest include 
title-page, dedication-page, preface-page, index-page, 
table-of-contents, regular-page, and so on. Even in this 
very simplified case we can expect constraints about the 
valid sequences of page categories in a book. For example, 
title-page is very unlikely to follow index-page and, sim- 
ilarly, dedication-page is more likely to follow title-page 
than preface-page. Constraint of this type can be captured 
and modeled using a stochastic grammar. Thus, information 
about the category of a given page can be gathered not only 
by examining the contents of that page, but also by examin- 
ing the contents of other pages in the sequence. Since con- 
textual information can significantly help to disambiguate 
between page categories, we expect that classification accu- 
racy will improve if the learner has access to whole sequences 
instead that single-page documents. 

In this paper we combine several algorithmic ideas to solve 
the problem of text categorization in the domain of multi- 
page documents. First, we use an algorithm similar to those 
described in [28] and [20] for inducing a stochastic regular 
grammar over sequences of page categories. Second, we in- 
troduce a hidden Markov model (HMM) that can deal with 
sequences of BOWs. Each state in the HMM is associated 
with a unique page category. Emissions are modeled by a 
multinomial distribution over word events, like in the gen- 
erative component of the Naive Bayes classifier. The HMM 
is trained from (partially) labeled page sequences, i.e. state 
variables axe partially observed in the training set. Unob- 
served states (which is the common setting in most classic 
applications of HMMs) arise here when document pages are 
partially unlabeled, like in the framework described in [23] 
and [13]. Finally, we solve the categorization problem by 
running the Viterbi algorithm on the trained HMM, yielding 
a sequence of page categories associated with new (unseen) 
documents. This is somewhat related to recent applications 
of HMMs to information extraction [9, 20] but the output 
labeling in our case is associated with the entire stream of 

] A related formulation would consist of assigning a global 
category to a whole multi-page document, but this formula- 
tion is not considered in this paper. 



text contained into a page, while in [9, 20] the HMM is used 
to attach labels to single words of shorter portions of text. 

Our approach is validated on a real dataset consisting 
of 95 issues of the journal American Missionary , which is 
part of the “Making of America” collection [26]. In spite 
of text noise due to optical recognition, our system achieves 
about 85% page classification accuracy when training on 10 
issues (year 1884) and testing on issues from 1885 to 1893. 
More importantly, we show that incorporating contextual 
information significantly reduces classification error, both 
in the case of completely labeled example documents and 
when unlabeled documents are included in the training set. 

2. BACKGROUND 

Let d be a generic multi-page document, and let dt de- 
note the t- th page within the document. The categoriza- 
tion task consists of learning from examples a function / : 
dt — > {c 1 ,--- yC K ) that maps each page dt into one out of 
K classes. 

2.1 The Naive Bayes classifier 

The above task can also be reformulated in probabilis- 
tic terms as the estimation of the conditional probability 
P(Ct = c k \dt) y Ct being a multinomial class variable with 
realizations in {c 1 , • • • , c K ). In so doing, / can be computed 
using Bayes’ decision rule, i.e. f(d) is the class with higher 
posterior probability. The Naive Bayes classifier computes 
this probability as 

P(C t = c k \d t ) oc P(d t \C t = c k )P(C t = c k ). (1) 

What characterizes the model is the so-called Naive Bayes 
assumption, prescribing that word events (each occurrence 
of a given word in the page corresponds to one event) are 
conditionally independent given the page category. As a 
result, the class conditional probabilities can be factorized 
as 

\d t \ 

P{dt\Ct = C k ) = n P(^t\Ct = c k ) (2) 

1=1 

where \dt\ denotes the length of page dt and w\ is the z-th 
word in the page. This conditional independence assump- 
tion is graphically represented by the Bayesian network 2 
shown in Figure 1. 

Although the basic assumption is clearly false in the real 
world, the model works well in practice since classification 
requires finding a good separation surface, not necessarily 
a very accurate model of the involved probability distribu- 
tions. Training consists of estimating model’s parameters 
from a dataset V of labeled documents (see, e.g. [21]). 

2.2 Hidden Markov models 

HMMs have been introduced several years ago as a tool for 
probabilistic sequence modeling. The interest in this area 
developed particularly in the Seventies, within the speech 

2 A Bayesian network is an annotated graph in which nodes 
represent random variables and missing edges encode con- 
ditional independence statements amongst these variables. 
Given a particular state of knowledge, the semantics of a 
Bayesian networks determine whether collecting evidence 
about a set of variables does modify one’s belief about some 
other set of variables [24, 11]. 
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Figure 1: Bayesian network for the Naive Bayes clas- 
sifier. 




Figure 2: Bayesian networks for standard HMMs. 



recognition research. community [25]. During the last years 
a large number of variants and improvements over the stan- 
dard HMM have been proposed and applied. Undoubtedly, 
Markovian modeling is now regarded as one of the most 
significant state-of-the-art approaches for sequence learn- 
ing. Besides several applications in pattern recognition and 
molecular biology, HMMs have been also applied to several 
text related tasks, including natural language modeling [5] 
and, more recently, information retrieval and extraction [9, 
20]. The recent view of the HMM as a particular case of 
Bayesian networks [2, 19, 27] has helped the theoretical un- 
derstanding and the ability to conceive extensions to the 
standard model in a sound and formally elegant framework. 

An HMM describes two related discrete-time stochastic 
processes. The first process pertains to hidden discrete state 
variables, denoted X t , forming a first-order Markov chain 
and taking realizations on a finite alphabet {x 1 ,*** , 

The second process pertains to observed variables or emis- 
sion, s, denoted D t . Starting from a given state at time 
0 (or given an initial state distribution) the model proba- 
bilistically transitions to a new state X\ and correspond- 
ingly emits observation D\. The process is repeated re- 
cursively until an end state is reached. Note that, as this 
form of computation may suggest, HMMs are closely re- 
lated to stochastic regular grammars [5]. The Markov prop- 
erty prescribes that Xt + 1 is conditionally independent of 
Xi,...,X t _i given X t . Furthermore, it is assumed that 
D t is independent of the rest given X t . These two con- 
ditional independence assumptions are graphically depicted 
using the Bayesian network of Figure 2. As a result, an 
HMM is fully specified by the following conditional proba- 



bility distributions 3 : 

P(X t \X t -i) (transition distribution) . . 

P(Dt\Xt) (emission distribution) ' ' 

Since the process is stationary, the transition distribution 
can be represented as a square probability matrix whose 
entries are transition probabilities P(X t = x l \X t -\ = x J ), 
abbreviated as P(x x |x J ) in the following. In the classic liter- 
ature, emissions are restricted to symbols in a finite alpha- 
bet or multivariate continuous variables [25]. As explained 
in the next section, our model allows emissions to be bag- 
of- words. 



3. THE MULTI-PAGE CLASSIFIER 

We now turn to the description of our classifier for multi- 
page documents. This section presents the architecture and 
the algorithms for grammar extraction, training, and classi- 
fication. 

3.1 Architecture 

In our case, HMM emissions are associated with entire 
pages of the document. Thus the realizations of the obser- 
vation D t are bag-of-words representing the text in the <-th 
page of the document. Within our framework, states are 
related to pages categories by a a deterministic function <j> 
that maps state realizations into page categories. We assume 
that <j> is a surjection but not a bijection, i.e. that there are 
more state realizations than categories. This enriches the 
expressive power of the model, allowing different transition 
behaviors for pages of the same class, depending on where 
the page is actually encountered within the sequence. How- 
ever, if the page contents depends on the category but not 
on the context of the category within the sequence 4 , the use 
of multiple states per category may introduce too many free 
parameters and it may be convenient to assume that 

F(D(|i') = P(D t \x j ) = P(D t \c k ) if *(*') = (j>(x j ) = c k . 

(4) 

This assumption constrains emission parameters to be the 
same for all the HMM states labeled by the same page cate- 
gory, a form of parameters sharing that may help to reduce 
overfitting. The emission distribution is then defined as for 
the Naive Bayes classifier, i.e. for every observed page dt 

\dt\ 

P W c k )=l[PM\c k ) (5) 

i~\ 

Therefore, the architecture can be graphically described as 
the merging of the Bayesian networks for HMMs and Naive 
Bayes, as shown in Figure 3. We remark that the state (and 
hence the category) at page t depends not only on the con- 
tents of the page, but also on the contents of other pages in 
the document. This probabilistic dependency implements 

3 We adopt the standard convention of denoting variables 
by uppercase letters and realizations by the corresponding 
lowercase letters. Moreover, we use the table notation for 
probabilities as in [11]; for example P(X) is a shorthand for 
the table [P(X=x 1 ),..., P(X=x r )] and P(X y y\Z) denotes 
the two-dimensional table with entries P(X = x l >Y =y\Z = 

f fc) - 

4 Of course this does not mean that the category is indepen- 
dent on the context. 
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Figure 3: Bayesian network for the hybrid HMM 
Naive Bayes architecture. 



the mechanism for taking contextual information into ac- 
count. 

The algorithms used in this paper are derived from the 
literature on Markov models [25], inference and learning in 
Bayesian networks [24, 11, 10], and classification with Naive 
Bayes [17, 15]. In the following we sketch the main issues 
related to the integration of all these methods. 

3.2 Induction of HMM topology 

The structure or topology of an HMM is a representation 
of the allowable transitions between hidden states. More 
precisely, the topology is described by a directed graph whose 
vertices are state realizations {x 1 , . . . , x N }, and whose edges 
are the pairs (x ; ,x l ) such that P{x x \x 3 ) ^ 0. An HMM is 
said to be ergodic if its transition graph is fully-connected. 
However, in almost all interesting application domains, less 
connected structures are better suited for capturing the ob- 
served properties of the sequences being modeled, since they 
convey domain prior knowledge. Thus, starting from the 
right structure is an important problem in practical hid- 
den Markov modeling. As an example, consider Figure 4, 
showing a (very simplified) graph that describes transitions 
between the parts of a hypothetical set of books. Possible 
state realizations are 5 {start, title, dedication, preface, toe, 
regular, index, end }. The structure indicates, among other 
things, that only dedication, preface, or table of contents can 
follow the title page. Self-loops indicate that a given cate- 
gory can be repeated for several consecutive pages. While 

dedication 

/ o 

start ^ title ^ preface ^ toe ► regular 

/\ 

index ^ end 

o 

Figure 4: Example of HMM transition graph. 




a structure of this kind could be hand-crafted by a domain 
expert, it is may be more advantageous to learn it automat- 
ically from data. 

We now briefly describe the solution adopted to automat- 
ically infer HMM transition graphs from sample multi-page 
documents. Let us assume that all the pages of the available 

5 Note that in this simplified example 0 is a one-to-one map- 
ping. 



training documents are labeled with the class they belong 
to. One can then imagine to take advantage of the observ- 
able distribution of data to search for an effective structure 
in the space of HMMs topologies. Our approach is based 
on the application of an algorithm for data-driven model 
induction adapted from previous works in Bayesian HMM 
induction [28] and construction of HMMs of text phrases 
for information extraction [20]. The algorithms starts by 
building a structure that is capable only to “explain” the 
available training sequences (a maximally specific model). 
The initial structure includes as many paths (from the initial 
state to the final one) as there are training sequences. Every 
path is associated with one sequence of pages, i.e. a distinct 
state is created for every page in the training set. Each 
state x is labeled by </>(x) t the category of the correspond- 
ing page in the document. Note that, unlike the example 
shown in Figure 4, several states are generated for the same 
category. The algorithm then iteratively applies merging 
heuristics that collapse states so as to augment generaliza- 
tion capabilities over unseen sequences. The first heuristic, 
called neighbor-merging, collapse two states x and x if they 
are neighbors in the graph and <j>(x) = <j>{x '). The second 
heuristic, called V-merging, collapses two states x and x' 
if <fi(x) = <f>(x) and they share a transition from or to a 
common state, thus reducing the branching factor of the 
structure. 



3.3 Inference and learning 

Given the HMM topology extracted by the algorithm de- 
scribed above, the learning problem consists of determining 
transition and emission parameters. One important distinc- 
tion that need to be made when training Bayesian network 
is whether or not all the variables are observed. Assuming 
complete data (all variables observed), maximum likelihood 
estimation of the parameters could be solved using a one- 
step algorithm that collects sufficient statistics for each pa- 
rameter [10]. In our case, data are complete if and only if 
the following two conditions are met: 

1. there is a one-to-one mapping between HMM states 
and page categories (i.e. N = K and for k = 1 , . . . , AT, 
4>{x k ) — c*), and 

2. the category is known for each page in the training doc- 
uments, i.e. the dataset consists of sequences of pairs 
({di , cj}, . . . , {dr, c^}), c* t being the (known) category 
of page t and T being the number of pages in the doc- 
ument. 



Under these assumptions, estimation of transition parame- 
ters is straightforward and can be accomplished as follows: 

P(*V) = Ar JV(C, ’ C?) (6) 

X>(cV) 

t = 1 



where TV^c 1 ,^) is the number of times a page of class c 1 
follows a page of class c 3 in the training set. Similarly, esti- 
mation of emission parameters in this case would be accom- 
plished exactly like in the case of the Naive Bayes classifier 
(see, e.g. [21]): 



P(w e \c k ) 



1 + N(w l ,c k ) 

\v\ 

i vi + ;>>(«, V) 

j= i 



(7) 
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where N(w l i c k ) is the number of occurrences of word w l 
in pages of class c k and \V\ is the vocabulary size (1/| | 
corresponds to a Dirichlet prior over the parameters and 
plays a regularization role for whose words which are very 
rare within a class). 

Conditions 1 and 2 above, however, are normally not sat- 
isfied. First, in order to model more accurately different 
contexts in which a category may occur, it may be conve- 
nient to have multiple distinct HMM states for the same 
page category. Second, labeling pages in the training set 
is a time consuming process that needs to be performed by 
hand and it may be important to use also unlabeled docu- 
ments for training [13, 23]. This means that label cj may be 
not available for some t. If assumption 2 is satisfied but as- 
sumption 1 is not, we can derive the following approximated 
estimation formula for transition parameters: 



P(x‘|x J ) 



N(x\x j ) 

~~n 

y^7V(x*,x J ) 

*=1 



(8) 



where N(x l ) x 3 ) counts how many times state x * follows x J 
during the state merge procedure described in Section 3.2. 
However, in general, the presence of hidden variables re- 
quires an iterative maximum likelihood estimation algorithm, 
such as gradient ascent or expectation-maximization (EM). 
Our implementation uses the EM algorithm, originally for- 
mulated in [7] and usable for any Bayesian network with 
local conditional probability distributions belonging to the 
exponential family [10]. Here the EM algorithm essentially 
reduces to the Baum- Welch form [25] with the only modi- 
fication that some evidence is entered into state variables. 
State evidence is taken into account in the E-step by chang- 
ing forward propagation as follows: 



at O') - 



0 if 0(x J ) ^ cl 

N 

at-i(i)P(x*\x l )P(dt\x*) otherwise 



(9) 

where att(i) = P(d\di • • • dt,Xt = x*) is the forward variable 
in the Baum-Welch algorithm. 

The M-step is performed in the standard way for tran- 
sition parameters, by replacing counts in Equation 6 with 
their expectations given all the observed variables. Emission 
probabilities are also estimated using expected word counts. 
If parameters are shared as indicated in Equation 4, these 
counts should be summed over states having the same label. 
Thus in the case of incomplete data, Equation 7 is replaced 

by 



E w*) 

P{w ( \c k ) = 

s\v i+EEE^V) E p (*‘i d ‘) 

J = 1 V t i:<f>(x i )=c k 

where S is the number of training sequences, N(w l ,c k ) is the 
number of occurrences of word w l in pages of class c fc , and 
P(x l \dt) is computed by the Baum-Welch procedure during 
the E-step. The sum on p extends over training sequences, 
while the sum on t extends over pages of the p-th document 
in the training set. The E- and M-steps are iterated un- 
til a local maximum of the (incomplete) data likelihood is 
reached. 



It is interesting to point out a related application of the 
EM algorithm for learning from labeled and unlabeled doc- 
uments [23], In that paper the only concern was to allow 
the learner to take advantage of unlabeled documents in 
the training set. As a major difference, the method in [23] 
assumes flat single-page documents and, if applied to multi- 
page documents, would be equivalent to a zero-order Markov 
model that cannot take into account contextual information. 

3.4 Page classification 

Given a document of T pages, classification is performed 
by first computing the sequence of states X\ , £ 2 , • • • ,xr that 
was most likely to have generated the observed sequence of 
pages, and then mapping each state to the corresponding 
category <p(xt). The most likely state sequence can be ob- 
tained by running the an adapted version of Viterbi’s al- 
gorithm, whose more general form is the max-propagation 
algorithm for Bayesian networks described in [11]. 

3.5 Feature selection 

Text pages should be first preprocessed with common in- 
formation retrieval techniques, including stemming and stop 
words removal. Still, the bag-of- words representation of 
pages can lead to a very high-dimensional feature space cor- 
responding to the vocabulary extracted from training docu- 
ments. A high-dimensional feature space, especially in this 
case where features axe noisy because of OCR errors, may 
lead to the overfitting phenomenon: the learner has very 
high accuracy on the training set but generalization to new 
examples is poor. Feature selection is a technique for lim- 
iting overfitting by removing non-informative words from 
documents. In our experiments we performed feature se- 
lection using information gain [30]. This criterion is often 
employed in different machine learning contexts. It mea- 
sures the average number of bits of information about the 
category that are gained by including a word in a document. 
For each dictionary term w> the gain is defined as 

K 

G(w) = -£p( C fc )log 2 P(c fc ) - 

fc=l 

K 

+ PM E log 2 P{c k \w) 

k= 1 
K 

+ P(w)^2 P{c k |u>) log 2 P{c k |u7) 

fc=l 

where w denotes the absence of word w. Feature selection 
is performed by retaining only the words having the highest 
average mutual information with the class variable. OCR 
errors, however, can produce very noisy features which may 
be responsible of poor performance even if feature selection 
is performed. For this reason, it may be convenient to prune 
from the dictionary (before applying the information gain 
criterion) all the words occurring in the training set with a 
frequency below a given threshold h. 

3.6 Learning with labeled and unlabeled pages 

Creating a training set for text categorization involves 
hand labeling in order to assign a category to each docu- 
ment. Since this is an expensive human activity, it is inter- 
esting to evaluate a classification system when only a frac- 
tion of the training documents pages are labeled, while other 
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documents are used without a category label. Clearly, unla- 
beled documents are available at very low cost. In the case 
of isolated page classification, previous research has demon- 
strated that learners such as Naive Bayes and support vector 
machines can take advantage of the inclusion in the training 
set of documents whose class is unknown [13, 23]. In par- 
ticular, the method presented in [23] uses EM to deal with 
unobserved labels. 

In the case of multi-page documents, the presence of miss- 
ing labels means that some pages of the training document 
sequences have no assigned category. The architecture in- 
troduced in this paper (see Figure 3) can easily handle the 
presence of unlabeled pages in the training set. Basically, 
evidence is entered into the states of the HMM chain only 
for those pages for which a label is known, while other state 
variables are left unobserved. The belief propagation algo- 
rithm is in charge of computing probabilities for these hidden 
variables. 

However, the structure learning algorithm presented in 
Section 3.2 cannot be applied in the case of partially la- 
beled documents. Instead, it is possible to use ergo die (fully 
connected) HMMs and deriving a transition structure by 
pruning, after the learning phase, those transitions having 
small probabilities with respect to an assigned threshold. In 
this way, we let EM derive a specific structure for the model 
(note that the only alternative in the case of partially la- 
beled documents would be to obtain a transition graph from 
a domain expert). 

4. EXPERIMENTAL RESULTS 

A preliminary evaluation of our system has been con- 
ducted in a digital library domain where data are naturally 
organized in the form of page sequences. The main purpose 
of our experiments was to make a comparison between our 
multi-page classification approach and a traditional isolated 
page classification system. 

4.1 Data Set 

We have chosen to evaluate the model over a subset of 
the Making of America (MOA) collection, a joined project 
between the University of Michigan and Cornell University 
(see moa.umdl.umich.edu/about.html and [26]) for collect- 
ing and making available digitized books and periodicals 
about history and evolution processes of the American soci- 
ety between the XIX and XX century. Presently, the whole 
archive contains electronic versions of important magazines 
of the XIX century. In our experiments, we selected a sub- 
set of the journal American Missionary (AMis), a sociolog- 
ical magazine with strong Christian guidelines. The task 
consists of correctly classifying pages of previously unseen 
documents into one of the ten categories described in Ta- 
ble 1. Most of these categories are related to the topic of 
the articles, but some are related to the parts of the journal 
(i.e. Contents, Receipts, and Advertisements). The dataset 
we selected contains 95 issues from 1884 to 1893, for a to- 
tal of 3222 OCR text pages. Special issues and final report 
issues (typically November and December issues) have been 
removed from the dataset as they contain categories not 
found in the rest. The first year was selected as training 
set (10 training sequences, 342 pages). The remaining doc- 
uments (from 1885 to 1993, for a total of 2880 pages) were 
used as a test set. The ten categories are temporally stable 
over the 1883-1893 time period. 



Name 


Description 


1. Contents 


Cover and index of surveys 


2. Editorial 


Editorial articles 


3. The South 


Afro-Americans’ survey 


4. The Indians 


American Indians’ survey 


5. The Chinese 


Reports from China missions 


6. Bureau of Women’s Work 


Female conditions 


7. Children’s Page 


Education and childhood 


8. Communications 


Magazine informations 


9. Receipts 


Lists of founders 


10. Advertisements 


contents is mostly graphic 



Table 1: Categories in the American Missionary do- 
main. 

Category labels were obtained semi-automatically, start- 
ing from the MOA XML files supplied with the documents 
collection. The assigned category was then manually checked. 
In the case of pages containing the end and the beginning of 
two articles belonging to different categories, the page was 
assigned the category of the ending article. 

Each page within a document is represented as a Bag-of- 
Words, counting the number of word occurrences within the 
page. It is worth remarking that in this application instances 
are text documents obtained by an OCR process. Imperfec- 
tions of recognition algorithm and the presence of images 
in some pages yields noisy text, containing misspelled or 
nonexistent words, and trash characters (see [3] for a report 
of OCR accuracy in the MOA digital library). Although 
these errors may negatively affect the learning process and 
subsequent results in the evaluation phase, we made no at- 
tempts to correct and filter out misspelled words, except 
for the feature selection process described above. However, 
since OCR extracted documents preserve the text layout 
found in the original image, it was necessary to rejoin words 
that had been hyphenated due to line breaking. 

4.2 Feature selection and isolated page classi- 
fication 

The purpose of the experiments in this section is to in- 
vestigate the effects of feature selection and to assess the 
baseline prediction accuracy that can be attained using the 
Naive Bayes classifier on isolated pages. In a set of pre- 
liminary evaluations we have found that best performance 
are achieved by pruning words with less than h = 10 oc- 
currences and then selecting an optimal set of informative 
words. We performed several tests by changing the informa- 
tion gain threshold that determines if a word is sufficiently 
informative (see Section 3.5), resulting in different vocab- 
ulary sizes with different accuracy of prediction. For each 
reduced vocabulary size we ran the Naive Bayes classifier on 
isolated pages. Results axe shown in Figure 5. Vocabulary 
size ranges from 15635 words (no feature selection), yielding 
65.07% classification accuracy, to 25 words, yielding 53.16% 
accuracy. The optimal vocabulary size is 297 words, ob- 
tained with a threshold gain of 0.089, yielding the best test- 
set accuracy of 72.57%. This result (72.57%) was considered 
as the base measure for performance comparison between 
our model and the Naive Bayes classifier. 

4.3 Sequential page classification 

Using the hybrid model presented in Section 3, documents 
can be organized into ordered sequences of pages. The train- 
ing set contains 10 sequences (monthly issues) of the same 
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naive Bayes prediction accuracy 




Vocabulary Dimension 



Figure 5: Naive Bayes accuracy as a function of vo- 
cabulary size (information gain criterion). Optimal 
vocabulary size is 297 words. 



Category 


Sequential 


Isolated 


Error red. 


Contents 


100 


100 


0% 


Editorial 


80.9 


63.11 


4SWo 


The South 


90.81 


71.84 


WfA% 


The Indians 


61.07 


44.3 


30A% 


The Chinese 


69.93 


60.78 


23 1% 


Bureau W.W. 


74.73 


66.3 


25A% 1 


Children’s Page 


78.26 


45.65 


60.0% 


Communications 


93.55 


92.47 


l4~3^ 


Receipts 


98.31 


98.31 


Wo 


Advertisements 


90.7 


62.79 


75 .0% 


Total Accuracy 


85.28 


72.57 


46.3% 



Table 2: Isolated classification (using the best Naive 
Bayes) vs. sequential classification (using the hybrid 
HMM with model merging). 



342 documents for year 1884, while test set is organized 
into 85 sequences for a total of 2880 documents from year 
1885 to 1893. The bag-of- words representation of pages fed 
into the HMM classifier was identical to that previously used 
with Naive Bayes (including preprocessing and feature selec- 
tion with a vocabulary of 297 words). We have considered 
two settings for validating the system. In the first setting, 
it is assumed that category labels cj are available for all 
the pages in the training set. In the second setting, some 
category labels are held out and training uses labeled and 
unlabeled pages. 

4.3.1 Completely labeled documents 
In the case of completely labeled documents it is possible 
to run the structure learning algorithm presented in Sec- 
tion 3.2. Figure 6 reports the structure learned from the 
10 training issues. Each vertex in the transition graph is 
associated with one HMM state and is labeled with the cor- 
responding category (see Table 1). Edges are labeled with 
the transition probability from source to target state, com- 
puted by counting state transitions during the state merg- 
ing procedure (see Equation 8). The associated stochastic 




Figure 6: Data induced HMM topology for Ameri- 
can Missionary, year 1884. 
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grammar implies that valid AMis sequences ought to start 
with the index page (class “Contents”), followed by a page 
of general communications. Next state is associated with 
a page of an editorial article. Self transition here has a 
value of 0.91, meaning that with high probability the next 
page will belong to the editorial too. With lower probabil- 
ity (0.07) next page is one of the “The South” survey or 
(prob. 0.008) “The Indians” or “Bureau of Women’s work”. 
Continuing this way we can associate a probability to each 
string of page categories. Since our purpose is to predict the 
correct string of categories, a good grammar helps filtering 
out classification hypothesis which generate low (or zero) 
probability strings. Note that under the parameter sharing 
assumption (see Equation 4), once the HMM structure is 
given, an estimate of the emission probabilities can be ob- 
tained using Equation 7. These values can be plugged in 
as initial emission parameters for the EM algorithm. Clas- 
sification is finally performed by computing the most likely 
state sequence. 

Table 2 summarizes classification results on test set doc- 
uments sequences, after a training phase applied both to 
Naive Bayes and our hybrid model. We report accuracy of 
prediction on single classes and average accuracy over the 
total of text documents. Comparison is made with respect 
to the best isolated-page classifier. The hybrid HMM clas- 
sifier (performing sequential classification) achieves 85.28% 
accuracy and consistently outperforms the plain Naive Bayes 
classifier working on isolated pages. The relative error re- 
duction is about 46%, i.e. roughly half of the errors are 
recovered thanks to contextual information. In particular, 
it is interesting to note the large error reduction for the cat- 
egory “Advertisements.” Pages in this category typically 
contain several images and few words of text. The isolated 
: page classifier is subject to prediction errors in this case 
since parameter estimation for rarely occurring words can 
be poor. On the other hand, the constraints imposed by 
the grammar allow to recover many prediction errors since 
advertisements normally occur near the end of each issue. 

In Figure 7 we report classification performances of the 
hybrid model on single issues of the journal. The graph is 
to be interpreted as the classifier temporal trend from 1885 
to 1894. Negative accuracy peaks correspond to test issues 
with more than 70 pages, a significant deviation from the 
average number of pages per issue (about 32). Values range 
from a minimum of 50% to a maximum of 97.09% with 10.41 
as standard deviation. To visualize a smoother trend, we 
calculated a running average over a temporal window of 10 
months, showing a clear superior trend over standard naive 
Bayes. 

4. 3. 2 Partially labeled documents 

We have performed six different experiments, for different 
percentages of labeled documents. In this case the structure 
learning algorithm cannot be applied and we used ergodic 
HMMs with ten states (one per class). After training, tran- 
sition with probabilities < 10“ 3 were pruned. In one of the 
six experiments we used all the available page labels with 
an ergodic HMM. This experiment is useful to provide a 
basis for evaluating the benefits of the structure learning 
algorithm presented in Section 3.2. 

Table 3 shows detailed results of the experiments. Classifi- 
cation accuracy is shown for single classes and for the the en- 
tire test set. As we can see, EM being completed uninformed 



Model performance on single sequences (merging algorithm) 
"American Missionary (1885-1 893)” 




Figure 7: Performance of the hybrid model on single 
sequences (merging algorithm). 





% of labeled documents 


Category 


0 


30 


50 


70 


90 


100 


Contents 


0 


100 


100 


100 


100 


100 


Editorial 


20.76 


21.12 


59.67 


58.6 


67.62 


71.41 


South 


1.51 


83.58 


69.73 


84.94 


84.34 


84.19 


Indians 


10.07 


0 


55.03 


51.68 


50.34 


58.39 


Chinese 


0 


27.45 


83.66 


76.47 


75.82 


75.16 


Bur.W.W. 


0 


43.22 


63.74 


63 


64.84 


65.93 


Child. P. 


4.35 


78.26 


73.91 


58,7 


78.27 


76.09 


Commun. 


0 


91.4 


91.4 


93.55 


93.55 


93.55 


Receipts 


0 


89.27 


98.68 


97.36 


98.31 


98.31 


Advert. 


81.4 


69.77 


93.02 


90.7 


90.7 


90.7 


Total 

Accuracy 


8.23 


55.66 


73.54 


75.66 


78.7 


80.24 



Table 3: Results achieved by the model trained 

by Expectation-Maximization, varying percentage 
of labeled documents. 

(0% evidence) is worse than the random guess (8.23% ac- 
curacy). With 50% of labeled documents, the model out- 
performs Naive Bayes (73.54% against 72.57%). This is a 
positive result, because the Naive Bayes training phase (in 
the standard formulation) need the knowledge of all docu- 
ment labels, while in this setting we simulate the knowledge 
of only a half of them. With greater percentages of labeled 
documents, performances begin to saturate reaching a max- 
imum of 80.24% when all the labels are known. This result 
is worse compared to the 85.28% obtained with the first 
strategy (see Section 4.3.1). The main difference is that 
in this case we started training from an ergodic model and 
we used one state per class. This confirms that in the case 
of completely labeled documents it is advantageous to use 
more states per class and to use the data-driven algorithm 
for structure selection. 

5. CONCLUSIONS 

We have presented a text categorization system for multi- 
page documents which is capable of effectively taking into 
account contextual information to improve accuracy with 
respect to traditional isolated page classifiers. Our method 
can smoothly deal with unlabeled pages within a document, 
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although we have found that learning the HMM structure 
further improves performance compared to starting from an 
ergodic structure. The system uses OCR extracted words 
as features. Clearly, richer page descriptions could be in- 
tegrated in order to further improve performance. For ex- 
ample, optical recognizer output information about the font, 
size, and position of text, that may be important to help dis- 
criminating between classes. Moreover, OCR text is noisy 
and another direction for improvement is to include more 
sophisticated feature selection methods, like morphological 
analysis or the use of n-grams [4, 14]. 

Another aspect is the granularity of document structure 
being exploited. Working at the level of pages is straight- 
forward since page boundaries are readily available. How- 
ever, actual category boundaries may not coincide with page 
boundaries and some pages contains portions of text related 
to different categories. Although this is not very critical for 
single-column journals such as the American Missionary , the 
case of documents typeset in two or three columns certainly 
deserves attention. A further direction of investigation is 
therefore related to the development of algorithms capable of 
performing automatic segmentation of a continuous stream 
of text, without necessarily relying on page boundaries. 

The categorization method presented in this paper is tar- 
geted to textual information. However, the same hybrid 
HMM methodology could be applied for classification of 
pages based on layout information, provided an adequate 
emission model is available. A suitable generative model for 
document layout is presented in [8]. 

Finally, categorization algorithms that includes contex- 
tual information may be very useful for other types of docu- 
ments natively available in electronic form. For example, the 
categorization of web pages may take advantage of the con- 
tents in neighbor pages (as defined by the hyperlink struc- 
ture of the web). 
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